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ABSTRACT 


One of the essential problems, in educational data mining, 
is to predict students’ performance on future learning ma- 
terials, such as problems, assignments, and quizzes. Pio- 
neer algorithms for predicting student performance mostly 
rely on two sources of information: students’ past perfor- 
mance, and learning materials’ domain knowledge model. 
The domain knowledge model, traditionally curated by do- 
main experts, maps learning materials to concepts, topics, 
or knowledge components that are presented in them. How- 
ever, creating a domain model by manually labeling the 
learning material can be a difficult and time-consuming task. 
In this paper, we propose a tensor factorization model for 
student performance prediction that does not rely on a pre- 
defined domain model. Our proposed algorithm models stu- 
dent knowledge as a soft membership of latent concepts. It 
also represents the knowledge acquisition process with an 
added rank-based constraint in the tensor factorization ob- 
jective function. Our experiments show that the proposed 
model outperforms state-of-the-art algorithms in predicting 
student performance in two real-world datasets, and is ro- 
bust to hyper-parameters. 


Keywords 
student modeling, predicting student performance, tensor 
factorization 


1. INTRODUCTION 


The popularity of online learning services and massive open 
online courses has led to extensive growth in the amount 
of student activity and learning data. As the number of 
students and learning materials increase in these online sys- 
tems, the need for automatic sense-making from this data, 
educational data mining, becomes more evident. One of the 
important tasks in educational data mining is accurately 
predicting students’ performance (PSP). PSP can be used 
in early detection of high-risk students that may fail or quit 
a class, in class evaluation and course planning activities, 


Thanh-Nam Doan and Shaghayegh Sahebi "Rank-Based Tensor 
Factorization for Predicting Student Performance" In: 
Proceedings of The 12th International Conference on 
Educational Data Mining (EDM 2019), Collin F. Lynch, Agathe 
Merceron, Michel Desmarais, & Roger Nkambou (eds.) 2019, pp. 
288 - 293 


Shaghayegh Sahebi 
Department of Computer Science 
University at Albany - SUNY 
Albany, NY 
ssahebi@albany.edu 


and in learning material recommendation to students. 


Many successful PSP techniques aim to predict students’ 
performance in a problem by modeling their state of knowl- 
edge in different concepts required by that problem. To do 
this, pioneer and recent PSP techniques rely on the avail- 
ability of a domain knowledge model that maps problems 
to concepts [19, 5, 25]. However, given the vast scope of 
learning materials in today’s online learning systems, such 
domain knowledge models may not be available. Ideally, a 
PSP model should be able to work without requiring such a 
predefined map. 


Additionally, a successful data mining model for PSP should 
be capable of considering specific characteristics of student 
learning process: (a) that students gain their knowledge on 
concepts over time, by practicing different problems, (b) 
that they may forget some of the gained knowledge, (c) that 
this knowledge gain is a gradual process, and (d) that learn- 
ing can happen differently for different students in different 
problems and different times. Finally, to provide better in- 
sight to students and teachers, such a model should also be 
interpretable considering these characteristics. Previous re- 
search in the literature only cover some of the limitations 
above. 


In this paper, we propose a student performance prediction 
model, Ranked-Based Tensor Factorization (RBTF), con- 
sidering all the above requirements. To model student se- 
quences on problems, we represent their scores over time as a 
three-dimensional tensor. To avoid the need for a predefined 
domain knowledge model, we propose a tensor factorization 
model for PSP, that maps problems and student knowledge 
in a lower-dimensional “latent” concept space. Representing 
student knowledge in this lower-dimensional space leads to a 
soft-membership approach that provides more flexibility by 
avoiding strict assignment of student knowledge to discrete 
“knowledge states”. By learning student, problem, and time- 
based biases in this model we take into account the differ- 
ences between students, problems, and times in the learning 
process. To capture the gradual learning requirement, we 
impose a rank-based constraint on student knowledge vari- 
ables, that allows for occasional forgetting of concepts, but 
imposes a generally positive learning trend. 


In our experiments, we study the proposed model in com- 
parison with two state-of-the-art baseline PSP algorithms, 
on two real-world datasets. Our experiments show that our 
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model performs better than both baselines in the task of 
predicting student performance. We experiment with the 
performance and sensitivity of our model with various hyper- 
parameters. 


Paper Outline. The remaining of the paper is organized 
as follows. Section 2 provides a brief literature review of the 
related work.Section 3 describes our model (RBTF) and the 
parameter learning steps. Section 4 evaluates extensively 
RBTF and other baselines on two real datasets. Lastly, 
Section 5 concludes the paper and suggests some directions 
for future works. 


2. RELATED WORK 


Many pioneer solutions to the problem of predicting stu- 
dent performance are based on either regression models [19] 
or Bayesian knowledge tracing (BKT) [5]. Regression-based 
models, such as performance factor analysis (PFA), try to 
predict students’ performance using a pre-defined domain 
model that maps learning material to knowledge compo- 
nents [19]. PFA, which is based on learning factor analy- 
sis [4], takes into account prior successes and failures of a 
student on knowledge components associated with the cur- 
rent problem. 


BKT is a constrained two-state hidden Markov model that 
models student knowledge in each knowledge component 
(KC) as two binary states: “known” and “unknown”. It 
learns the probability of transitioning between these two 
states, and probabilities of students’ success and failure in 
each KC, given their state of knowledge. Despite being suc- 
cessful in PSP for certain datasets, this model, in its origi- 
nal form, does not consider continuous states of knowledge 
or soft membership to knowledge states. Moreover, BKT 
does not capture the relationships between KCs, and is not 
personalized for individual student. Additionally, BKT also 
relies on a pre-defined domain model. Recently, new BKT- 
based models aim to address some of these problems [2, 
9, 29]. For example, Pardos and Heffernan has addressed 
BKT’s non-personalized modeling in [18, 17]. Song et al. 
proposed PSFK in [25] to address PSP when students en- 
counter a knowledge component for the first time. But, these 
models rely on labeled problem knowledge components or 
concepts. Later, Gonzalez-Brenes and Mostow proposed a 
topical hidden Markov model that jointly learns the domain 
model and predicts student performance [10, 8]. However, 
this model has two restricting assumptions: that at each at- 
tempt, the student works on one skill of a problem, and that 
the students do not forget any acquired skills. 


Recently, other approaches inspired by recommender sys- 
tems’ research and factorization models have been used for 
PSP. Despite being successful, these approaches are not tai- 
lored for the educational data mining problems specifically 
since they do not explicitly model student learning as a 
learning gain process. The matrix-factorization approaches 
in this area do not consider student sequences and only rely 
on a snapshot of student performance. For example, Thai- 
Nghe and Schmidt-Thieme proposed a multi-relational fac- 
torization student model that considers multiple relations 
between students and tasks, but does not consider student 
sequences [27]. Later, Nedungadi and Smruthy proposed 
a similar multi-relational matrix factorization approach ex- 


ploring the effect of modeling biases [16]. Sahebi et al. also 
proposed another multi-relational learning approach that 
learned student performance according to canonical corre- 
lation analysis [22]. Non-negative matrix factorization has 
been used to improve performance predictions [28]. Pero et 
al. compared collaborative filtering techniques for the task 
of PSP in a small dataset [20]. Elbadrawy et al. predict stu- 
dent performance using their interactions with the learning 
management system to achieve a higher accuracy [7]. 


Some other recommender system-based approaches consider 
student sequence, but do not explicitly model knowledge 
gain in students. For example, Thai-Nghe et al. explored 
different factorization models, including tensor factorization, 
to predict student performance [26]. Sahebi et al. [23] stud- 
ied educational data mining methods, such as PFA and 
BKT, with matrix and tensor factorization approaches, from 
the recommender systems literature, for PSP. Almutairi et 
al. have used tensor and coupled-matrix factorization to pre- 
dict course-based student performance [1]. However, their 
tensor decomposition models do not explicitly model stu- 
dents’ knowledge gain. 


Although there have been some promising research on PSP 
that consider student sequence without requiring a domain 
model, these approaches have been limited. For example, 
SPARse Factor Analysis (SPARFA) by Lan et al. that uses 
Kalman filters to jointly learn the domain model, student 
knowledge, and the underlying question difficulties, can be 
very expensive to learn due to having a big state space [12]. 
Sahebi et al. have proposed a feedback-driven tensor factor- 
ization algorithm that can model student gradual knowledge 
acquisition [24]. But, their model has a strict constraint 
that does not allow for forgetting the concepts by students. 
Lindsey et al. proposed a non-parametric Bayesian tech- 
nique that can refine the expert-labeled skills. However, 
they simplify the problem by finding coarse-grained skills 
as they restrict each problem to have exactly one skill [14]. 
In this paper, we propose a tensor factorization model for 
predicting student performance that does not require do- 
main knowledge, models problems and student knowledge 
as soft-membership of latent concepts, and can model stu- 
dent sequences and gradual knowledge increase. 


3. RANK-BASED TENSOR 
FACTORIZATION (RBTF) 


Here we present our model, rank-based tensor factorization, 
by which we aim to predict students’ performance in prob- 
lems, considering their performance sequence and knowledge 
growth. Our proposed model is inspired by the recommender 
systems domain. Our choice of a recommender systems- 
based model was because of two main reasons: a) student 
performance similarities, and b) problem similarities. First, 
we consider that students with similar knowledge levels will 
perform similarly in solving problems. Second, we assume 
that a student will have similar performance on two prob- 
lems with similar concepts. Recommender-based factoriza- 
tion models consider these two expectations. However, as 
discussed in the introduction section, a successful student 
model needs to include additional considerations. One of 
which is that knowledge gain is a gradual process for stu- 
dents, which happens over time. As students interact with 
learning materials, such as problems, they learn from them. 
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To represent this time-based process, we model students’ 
activity sequences as a series of attempts on problems. For 
the student performance data to be represented according 
to these assumptions, we represent student sequences in a 
three-dimensional tensor ()), that has the student, prob- 
lem, and time (attempt) dimensions. Each cell ya,s,p in this 
tensor represents student s’s score in problem p, that she 
has chosen to study at attempt a. 


The core idea of the aforementioned assumptions is the no- 
tion of “concepts”: gradual learning can be viewed as gain- 
ing knowledge in course concepts; student knowledge-based 
similarities are based on how much they mastered each of 
the concepts; and problem similarities are defined on how 
their represented concepts are shared. However, in many 
online educational systems, concepts are undefined and dif- 
ficult to measure. In these systems, there are no “observed” 
features defined as problem concepts or knowledge compo- 
nents. Hence, we propose to discover shared “latent” fea- 
tures between students and problems as representatives for 
the notion of concepts. We model each problem as a vector 
of k latent concepts, that shows the importance of each la- 
tent concept in that problem. Also, we model each student’s 
knowledge at any time point as another vector of the same 
latent concepts. 


We assume that a student s’s performance on problem p at 
time a is a result of her existent knowledge in the latent con- 
cepts required by the problem. Accordingly, we model esti- 
mated student score Yas,» as a dot product between prob- 
lem’s latent concept vector qp and student’s knowledge in 
those concepts ta,s: 


Ya,s,p © ta,s-Qp (1) 


To maintain the interpretability of our model, we enforce 
latent variables in gp to be non-negative. Here, by choosing 
the number of concepts (k) less than the number of problems 
and students, we are representing students and problems in 
a lower-dimensional latent space that can better capture stu- 
dent and problem similarities (our second and third assump- 
tions). However, the model in Equation 1 does not consider 
differences in factors such as student ability, problem dif- 
ficulty, or student cohort strength. For example, students’ 
average score in a more difficult problem is expected to be 
less than their average score in an easier problem. To address 
this issue, we add student, problem, and attempt biases (bs, 
bp, and ba), in addition to an overall cohort bias (4) to our 
model: 


Ya,s,p y ta,s-Qp + bs + bp + ba + LL (2) 


To learn the parameters of this problem (7, Q, bs, bp, ba; 
and yw) we minimize the objective function in Equation 3. 
The first component calculates the squared difference be- 
tween observed student scores and estimated student scores. 
The last three components are for regularizing biases, stu- 
dent knowledge, and problem concepts for generalizability 
purposes. 


“A= oy (Ga,s.p — iaaey 
hie (3) 


+ A(BS +b; + ba) + As |[ta,s||” + A2 ll@pll? 


The model in Equation 2 does not address our gradual learn- 
ing assumptions for students. To capture this gradual learn- 
ing, we can assume that a student’s knowledge (ta,s) in- 
creases over time. But, we should also note that this knowl- 
edge increase depends on the problems that the student se- 
lects to solve and the concepts presented in them. As a 
result, we can translate this knowledge increase as an in- 
crease in estimated student scores in problems (ta,s.qp). In 
other words, we expect that student s’s predicted scores at 
attempt a to be larger than her scores at attempt a — 1: 


ta,s-Qp — ta-1,s-Gp 2 0 


In reality, this knowledge increase can be non-monotonic. 
For example, a student may forget some concepts after a 
while. For this reason, we propose to use a rank-based 
model for student knowledge gain, that allows knowledge 
loss to happen for students, but penalizes it. Using this rank- 
based model, we aim to maximize the difference between the 
aggregation of all students’ scores on all questions at each 
attempt versus the attempts before that. Hence, we would 
like to maximize % in Equation 4. Here, o(-) is the sigmoid 
function, defined as o(a) = 1/(1+e 7”). Sigmoid function 
is selected because of its superiority in rank-based recom- 
mendation systems [21, 6]. The term log(o(ta,sqp — tj,sqp)) 
means that for attempt a of student s, the ranking of s’s 
score at a is higher than the one of s at 7 with j <a. 


La= > yy SS log(o(ta,sp — tj,sQp)) (4) 


To capture the dynamics between all assumptions, we com- 
bine the minimization of “% in Equation 3 and maximization 
of %2 in Equation 4. Our final objective is to minimize the 
loss function in Equation 5. The hyper-parameter w is to 
control the relative strictness of knowledge increase versus 
the importance of having a more accurate estimate of stu- 
dent performance. 


yn 2 2 2: 
L =~ Gasp — Ya.s.p)? + Xa lIta,sll? +2 llaell 


a,8,p 
+A(b5 +b, +02) -—w >> > S- log(o(ta,sdp — ti,s4p)) 
j=l s p 
(5) 


Learning the Parameters: By using stochastic gradient 
descent algorithm to minimize /%, we find student knowl- 
edge of each latent concept at any point, the importance 
of each latent concept in each problem, and estimation of 
student score in each problem at any attempt. Recall that 
the parameters the we want to infer are 7, Q, bs, bp, ba, 
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and yw. For the cohort bias jz, we assign the average score 


of all students on all problems [11], ie. w = een bee. 
a,s,p 955 


where Z(a,s,p) is an indicator function returning 1 if the 
tuple (a, s,p) is in our training set; otherwise 0. 


4. EXPERIMENTS 


In the following, we evaluate our model in comparison with 
two state-of-the-art methods in the task of predicting stu- 
dent performance. Further, we analyze how our solution 
models students’ learning process by looking at students’ 
knowledge gain in course concepts. Eventually, we experi- 
ment on RBTF’s sensitivity to various hyper-parameter set- 
tings. 


4.1 Dataset and Experiment Setup 

For experiments, we use the Canvas network dataset! which 
is available online [3]. Canvas Network hosts many freely 
available open online courses. In addition to learning mod- 
ules, each course can have different types of assignments, 
discussions, and quizzes. In this platform, participants are 
not limited to a specific sequence of learning material or 
assignments. The dataset is anonymized such that student 
IDs, course names, discussion contents, submission contents, 
or course contents are not available. 


Dataset |#students|#problems|#attempts|Avg. attempts 
Course 1 531 91 87 29.92 
Course 2 2597 32 30 12.73 


Table 1: Dataset Statistic. 


We select two courses in Canvas and denote them as Course 
1 and Course 2. The selected courses have the most number 
of quizzes in the whole dataset. We consider each quiz as 
a problem in our model. Quizzes are graded between zero 
and a maximum possible score. For consistency, we normal- 
ize the quiz grades between zero and one. Table 1 shows 
the statistics of these two courses. As shown in the table, 
Course 2 has more students but less number of problems and 
attempts than Course 1. 


The data of each course is represented as a list of tuples 
(attempt, student id, quiz id, grade). We randomly split 
80% of tuples for training and the remaining (i.e. 20%) for 
testing. 


Hyper-parameter Setting: In the performance predic- 
tion experiments (Section 4.2), we set w = 0.5, A1 = A2 = 
0.1 and regularization of bias 4 = 0.001. The number of 
concepts is set to 3. 


4.2 Student Performance Prediction 
In this section, we compare the prediction performance of 


RBTF with other baselines to evaluate the prediction ability 
of RBTF. 


Baselines: To compare the prediction performance, we em- 
ploy the following two baselines: 


‘nttp://canvas.net 


e Feedback-Driven Tensor Factorization (FDTF) [24]: It 
is a tensor factorization model specifically tailored to 
predict students’ performance. It considers students’ 
gradual learning process. However, the assumption of 
hard constraint on knowledge increase in students lim- 
its its modeling capacity. Also, it does not include 
biases and does not allow for the concepts to be for- 
gotten by students. 


e SPARse Factor Analysis (SPARFA) [13]: SPARFA is 
a probabilistic factor analysis approach that calculates 
the probability of a student’s correct response to a 
problem. It does not require a predefined domain 
knowledge model. However, it does not consider stu- 
dents’ sequences. To adapt it to our problem, we use 
the probability scores instead of the predicted student 
grade. 


Metrics: We use two measures to evaluate the performance 
prediction task. Since our main goal is to predict student 
scores or grades, we would like to measure how close our 
predictions are to the actual student scores. To do this, we 
use the root mean squared error (RMSE). The lower the 
value of RMSE, the better the model. 


Since many performance prediction models focus on predict- 
ing students’ success and failure as a binary value, instead of 
their score [13, 5], we also employ accuracy for performance 
comparison. To do this, we regard scores greater than 0.5 
as success and the rest as failure. Unlike RMSE, the higher 
the value of accuracy, the better the model. 


RMSE 
PEEPS 


Dataset Accuracy 


PA RTA 


| RBTF [FDTF) | RBIF' [FDTF] E 
Course I] 0.12 | 0.27 0.59 92.5% [85.2%] 81.7% 
Course 2]0.2056]0.2116] 0.567 95.24% | 92.8% | 87.41% 


Table 2: Prediction Performance. 


Results: Table 2 shows the prediction performance of our 
model (RBTF) and the two baselines (FDTF and SPARFA) 
on the two datasets. As we can see, both tensor factorization 
models (RBTF and FDTF) perform better than SPARFA 
in both courses. This shows the importance of considering 
student sequences in predicting their performance. Also, 
we can see that RBTF performs better than FDTF in both 
courses. This shows that, even though modeling sequential 
knowledge increase in students is important, this increase 
should not be strictly monotonic and should be flexible to 
allow for occasional forgetting of concepts. 


4.3 Hyper-parameter Sensitivity Analysis 

In this section, we study RBTF’s sensitivity to hyper pa- 
rameter values. First, we experience on the balance between 
training error on student performance fitting (“% in Equa- 
tion 3) versus modeling student knowledge increase (2 in 
Equation 4) on the generalizability of our model. To do this, 
we measure the test error by varying hyper-parameter w, 
that controls this balance in Equation 5. Then, we capture 
the effect of the number of concepts on RBTF’s performance 
by varying & in Equation 5 and measuring its error on test 
data. 
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Dataset 0.01 0.25 0 0.75 1.0 
Course 1 


Ww 
AG) 
0.191 0.128 0.12 0.137 0.141 
Course 2 
Table 3: RMSE with different value of w and number 
of concept is 3. 


Sensitivity to w: Recall that w controls the trade-off be- 
tween having an accurate estimation of student performance 
and the constraint of knowledge increase. A larger value 
of w, imposes more contribution of knowledge increase con- 
straint to the performance of RBTF, and a smaller value of w 
dictates a stricter fit of student performance to the training 
data. We use different values of w from 0 to 1 and measure 
RBTF’s RMSE corresponding to these values. For other pa- 
rameters, we use the default values mentioned in Section 4.1. 
Table 3 presents the performance of RBTF with different 
values of w on the two datasets. From the table, we observe 
that w = 0.5 yields the best performance of RBTF and it is 
consistent for the two datasets. However, the results from 
Course 2 dataset is more sensitive to the changes in w. One 
reason for this can be the smaller number of attempts and 
more sparsity of Course 2 dataset, compared to Course 1 
dataset, that can lead to easier overfitting to training data. 


k 
Dataset 3 5 10 15 
Course 1 0.12 0.122 | 0.127 0.128 
Course 2 | 0.2056 | 0.206 | 0.2065 | 0.2065 


Table 4: RMSE with different value of number of 
concepts and w = 0.5. 


Sensitivity to k: Recall that, in our model, concepts are 
latent lower-dimensional representations of student perfor- 
mance and problems over attempts. They can be used to 
model the similarity between students and problems. To 
measure the effect of the number of concepts k, we tune the 
value of k while using the default values for other parame- 
ters (see Section 4.1). We measure the RMSE of RBTF by 
changing k. Table 4 shows the results. From the table, we 
observe that increasing the value of k makes RBTF perform 
slightly worse. This finding is consistent in both datasets. 
However, RBTF is relatively robust to k as this increase in 
error is minor. 


5. CONCLUSION AND FUTURE WORK 

In this paper, we proposed a novel rank-based tensor factor- 
ization method (RBTF), which is able to predict the perfor- 
mance score of students by considering the gradual learning 
of students as a ranking problem. Our model has the flexi- 
bility to present student knowledge as a soft-membership of 
latent concepts, only requires activity sequences of students, 
and discovers individualized student knowledge model in- 
cluding biases. Our extensive evaluations show that RBTF 
outperforms state-of-the-art baselines in both root mean 
square error and accuracy measures. Also, we show our 
models robustness to hyper-parameters by experimenting 
the balance between knowledge ranking and performance 
fitting parts of the model, and by varying the number of 
latent concepts. 


There are several directions to extend this research work 
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further. In this work, we experiment on performance pre- 
diction within the same course. This model can be used 
to experiment on between-course performance predictions. 
Another application of our model is to detect knowledge 
gaps in students and recommend useful learning materials to 
them. Moreover, contingent on the availability of a domain 
knowledge model, this work can be extended to improve 
the existing domain knowledge model. Recent studies show 
that order and length of students’ activities are essential for 
understanding students’ performance [15]. So, integrating 
these features can enhance the prediction performance of 
our model. 
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